Complete GFW Rulebook for Wikipedia, Plus Comprehensive List for Websites, IPs, IMDB and AppStore
Bookreader Item Preview
Share or Embed This Item
Complete GFW Rulebook for Wikipedia, Plus Comprehensive List for Websites, IPs, IMDB and AppStore
- by
- Xia Chu
- Publication date
- 2013-12-25
- Topics
- Internet censorship, China, GFW, Great Firewall of China, Wikipedia, IMDB, App Store
- Collection
- opensource
- Language
- English
- Item Size
- 51.5M
In this report, we detail the complete and exact rulebook that the Great Firewall of China (GFW) exerts on Wikipedia. We call it “rulebook” (instead of the common term “blacklist”) because we not only identify the blacklisted terms, but also the exact string matching rules deployed by GFW. An efficient probing methodology makes this possible.
GFW blocked Wikipedia outright in the early years but gradually loosened the blockage, first by unblocking all non-Chinese versions, then by unblocking the Chinese version, except for certain entries deemed harmful by the China authority.
There have been some efforts in understanding the Wikipedia blacklist, for example, at time of writing, the site Greatfire.org tracks ∼ 700 Wikipedia pages and ∼ 400 are claimed to be blocked or partially blocked in China.
Wikipedia contains millions of pages, e.g. more than 700,000 articles for the Chinese version, and more than 4,240,000 articles for the English version. It seems a daunting and unfeasible task to test these pages exhaustively, hence there has been no well known attempt to gather the complete blacklist.
While a small sample of the blacklist is useful, the complete picture can be much more powerful in revealing the underlying works of GFW and its operators. In this study, we devised a methodology which efficiently examines the entire Wikipedia corpus, hence exposing to the world the complete GFW rulebook for Wikipedia the first time. In total, there are 936 rules (excluding website URL terms) which are applicable to Wikipedia, affecting 5340 pages in Chinese Wikipedia and 67 English Wikipedia pages.
Furthermore, using this methodology, we examined more than a million website names (obtained from Alexa and several online lists regarding sites blocked by GFW). We identified 3644 GFW filtering rules targeting website names. This list is significantly more comprehensive and more precise than any precedents. We also applied the methodology to IMDB (4M titles examined, 6 rules identified), a big repository of AppStore apps (648,567 items, 26 rules identified), and many IP strings (786,432 IPs examined, 130 rules identified).
The revealed rulebook demonstrates that the GFW operation is haphazard and ill-maintained. The GFW filtering rules are like a cesspool. At the same time, Chinese censorship bureaucracy intends to be thorough and extensive.
We created a monitoring pipeline for Wikipedia, which checks whether GFW adds any new rules against Wikipedia.
All findings in this report, plus new updates are recorded on a master spreadsheet located at goo.gl/zKslcu. I will also send new updates (e.g. additions or removals of rules, or other changes to GFW) to summeragony@googlegroups.com. Interested parties can send an email to summeragony+subscribe@googlegroups.com to subscribe.Notes
Original source: https://docs.google.com/file/d/0B8ztBERe_FUwLWxUX0laeWF3aE0/edit
- Addeddate
- 2018-09-04 03:12:25
- Identifier
- GFW_Wikipedia_V3.0
- Identifier-ark
- ark:/13960/t4jm9fn49
- Ocr
- ABBYY FineReader 11.0 (Extended OCR)
- Pages
- 81
- Ppi
- 300
- Scanner
- Internet Archive Python library 1.7.7
- Version
- 3.0
comment
Reviews
3,220 Views
DOWNLOAD OPTIONS
For users with print-disabilities
IN COLLECTIONS
Community TextsUploaded by David Fifield on
Open Library